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Rejoinder: Demystifying Double 
Robustness: A Comparison of Alternative 
Strategies for Estimating a Population 
Mean from Incomplete Data 



Joseph D. Y. Kang and Joseph L. Schafer 

1. CLARIFYING OUR POSITION ON DOUBLY 
ROBUST ESTIMATORS 

We are grateful to the editors for eliciting com- 
ments from some of the most prominent researchers 
in this exciting and rapidly developing field. After 
we drafted our article, a number of important works 
on DR estimators appeared, including Tan's (2006) 
article on causal inference, the monograph by Tsiatis 
(2006) and the recent articles and technical reports 
cited by Robins, Sued, Lei-Gomez and Rotnitzky. 
The discussants' insightful remarks highlight these 
recent developments and bring us up to date. 

Our purpose in writing this article was to pro- 
vide unfamiliar readers with gentle introduction to 
DR estimators without the language of influence 
functions, using only simple concepts from regres- 
sion analysis and survey inference. We wanted to 
show that DR estimators come in many different 
flavors. And, without minimizing the importance of 
the literature spawned by Robins, Rotnitzky and 
Zhao (1994), we wanted to draw attention to some 
older related methods from model-assisted survey 
sampling which arrive at a similar position from the 
opposite direction. 
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Despite the good performance oi fioLS in our sim- 
ulated example, we have not and would not argue 
that it be used routinely and uncritically. The pit- 
falls of relying solely on outcome regression or y- 
modeling have been well documented for causal in- 
ference, where the rates of missing information are 
high and the impact of selection bias can be dra- 
matic (e.g., Rubin, 2001). Nor do we wish to cast 
clouds of suspicion over all DR estimators in all cir- 
cumstances. In many situations, they do work well. 
On the other hand, we still believe that procedures 
motivated by parametric y-models, when carefully 
implemented, remain a viable option and should not 
be categorically dismissed. 

Under ignorability, the propensities vTj = P{ti = 
l|xi),i = l,...,n, play no role in likelihood-based or 
Bayesian inference about ^ under a given y-model. 
If we had absolute faith in one parametric form for 
P{yi\xi)-, then we could discard all information be- 
yond the sufficient statistics for that model. But the 
propensities carry information that helps us evalu- 
ate the quality of the y-model, and we ignore this 
extra information at our peril, because no model is 
above criticism. No sensible statistician would ar- 
gue that propensities should not be examined. But 
reasonable persons may differ over what role the 
propensities should play in formulating an estima- 
tor. Those who favor a semiparametric approach de- 
vise influence functions that combine inverse-propen- 
sity weights with regression predictions for y. Para- 
metric modelers, on the other hand, may well argue 
that if the propensities reveal weaknesses in the y- 
model, then that model should be revised and cor- 
rected. The latter view has been expressed by El- 
liott and Little (2000) in the context of survey esti- 
mation, where the selection probabilities are known, 
but parallels to uncontrolled nonresponse and causal 
inference are obvious. 
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We believe that propensities are useful for model 
diagnosis and estimation, but we are still not con- 
vinced that they need to enter an influence func- 
tion as inverse-propensity weights. The strength of 
weighting is that, if done properly, it protects an es- 
timate from bias regardless of how y is distributed. 
But this strength can also be a weakness, because 
such a high level of protection is not always war- 
ranted. If the propensities are unrelated to the lin- 
ear predictors from a good y-model, then weighting 
may be superfluous. If the propensities are poorly 
estimated or extreme, then combining weights with 
the regression predictions may do more harm than 
good. And if the propensities do reveal weaknesses 
in the y-model, inverse-propensity weights are not 
the only way to correct them. 

2. RESPONSE TO TSIATIS AND DAVIDIAN 

In their illuminating discussion, Tsiatis and Da- 
vidian demonstrate that a wide variety of estima- 
tors for /_f can be expressed as the solution to an 
estimating equation based on an influence function. 
(One possible exception is the class of estimators 
based on propensity-score matching, which we have 
not discussed.) Influence functions present interest- 
ing results on semiparametric efficiency, but we find 
them appealing for other reasons as well. First, they 
show us how to compute a standard error for what- 
ever estimator we choose. Second, they generalize 
nicely to finite-population sample surveys with com- 
plex designs. Regression techniques for complex sur- 
veys, as implemented in software packages like SU- 
DAAN (Shah, Barnwell and Biler, 1997), are based 
on weighted influence functions, so any of the es- 
timators described by Tsiatis and Davidian can be 
extended to surveys. Third, if we move on to causal 
inference, we must address the thorny issue of the 
inestimable partial correlation between the poten- 
tial outcomes. Any estimator of an average causal 
effect makes a working assumption about this cor- 
relation (e.g., setting it to zero), but a standard error 
computed from an influence-function sandwich may 
still perform well when this working correlation is 
incorrect. 

Tsiatis and Davidian mention that our estimator 
fj'-K-cov, which incorporates propensity-related basis 
functions into the OLS procedure, is not consistent 
under Mi U A^/j unless the conditional mean of yi 
happens to be a linear combination of the particular 
basis functions for vTj used in the OR model. This is 



certainly true for the usual asymptotic sequence in 
which the number of basis functions remains fixed 
as n — > oo. But if we allow the basis to grow with 
the sample size (e.g., as in a smoothing spline), then 
it may become DR (Little and An, 2004). Given a 
large sample, a good data analyst will tend to fit a 
richer model than with a small sample. If the analyst 
is allowed to build a rich OR model that corrects 
for the kind of inadequacies shown in our Figure 4, 
then the OLS procedure based on the corrected OR 
model may be as good as any DR procedure. 

We like the suggestion by Tsiatis and Davidian 
of using a hybrid estimator that combines inverse- 
propensity weighting for cases with moderate propen- 
sity and regression predictions for cases with small 
propensity, an idea echoed by van der Laan and Ru- 
bin (2006). As an alternative to a hard threshold 
5 at which the change is made, one could opt for 
a smoother transition by "weighting" each part of 
the influence function more or less depending on 
the estimated propensity. We also agree with Tsi- 
atis and Davidian that estimators in the spirit of 
fi-iT-cov deserve more consideration even though they 
are not DR over M.jU M.n in the usual asymptotic 
sequence. In the simulations of our article, we ex- 
pressed rrii as a piecewise constant function of vTj 
with discontinuities at the sample quintiles of tTj. 
Another version of fin-cov that we have found to 
work well in many situations uses a linear spline in 
ryj = log(7rj/(l — VTj)) with knots at the quintiles. 

3. RESPONSE TO TAN 

Tan's important work on regression estimators con- 
nects the theory of influence functions to ideas of 
survey regression estimators and the use of control 
variates in importance sampling. His remarks and 
propositions are very helpful for understanding the 
behavior of IPW, OR and DR methods in realistic 
settings where all the models are incorrect. 

We were initially puzzled by several of Tan's points 
but, upon further consideration, found them to be 
very insightful. He states that it is more construc- 
tive to view DR estimators as efficiency-enhanced 
versions of IPW than as bias-corrected versions of 
OR. We find both views helpful for understanding 
the nature and properties of DR methods. But, as 
he explains, there are theoretical reasons to expect 
that his carefully crafted DR estimators may lead to 
greater improvement over IPW than over a good OR 
model, because IPW is conservative whereas OR is 
aggressive. 
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We are still unsure why Tan states that IPW ex- 
trapolates explicitly whereas OR extrapolates im- 
plicitly. To us, fitting an OR model to respondents 
and using that model to predict for nonrespondents 
is a very obvious kind of extrapolation, especially 
if the leverage values for some nonrespondents are 
large relative to those of the respondents. But his 
points about extrapolation are well taken. All of our 
methods extrapolate. The assumption of ignorabil- 
ity is itself an extrapolation. 

He also points out that estimating an average causal 
effect is more subtle than simply estimating the mean 
of each potential outcome and taking the difference. 
This distinction is important in a semiparametric 
approach. A semiparametric method that is optimal 
for estimating two means may not be optimal for 
estimating the mean difference. Similarly, a method 
that is optimal for estimating a population average 
causal effect may not be optimal for estimating the 
average effect among the treated, or for estimating 
differences in average causal effects between subpop- 
ulations. As parametric assumptions about the OR 
model are discarded, it becomes important to tailor 
the estimation procedure to the estimand, which his 
regression estimators apparently do. 

In Tan's simulations, his alternative model in which 
the analyst sees Xi = (Z3 + Zi + 20)2 presents an in- 
teresting situation where OLS predicts the y^'s for 
the respondents almost perfectly (i?2?a0.99), but 
the extrapolated linear predictions for the nonre- 
spondents are biased because the unseen true values 
of Ui turn sharply away from those predictions in the 
region of low propensity. This is a perfect illustration 
of how the uncritical use of fioLS can lead us astray. 
But in this example, propensity-based diagnostics 
reveal obvious deficiencies in the linear model. Tak- 
ing the initial sample of n = 200 observations from 
our article, we fit the linear model to the respon- 
dents and a logistic propensity model to all cases 
given Xi, X2, X3, and Tan's alternative version of 
X4. A plot of the observed residuals from the y- 
model versus the estimated logit-propensities from 
the vr- model is shown in Figure 1. The loess curve 
clearly shows that the OLS predictions are biased 
in the region of high propensity (where it does not 
really matter) and in the region of low propensity 
(where it matters very much) . If we account for this 
trend by introducing the squared linear predictor 
from the logit model r}? = (xja)^ as one more covari- 
ate in the y-model, the performance of jloLS greatly 
improves. Even better performance is obtained with 



splines, which tend to predict better than ordinary 
polynomials over the whole range of jy^'s. We created 
a linear spline basis for fji with four knots located 
at the sample quintiles of fji. That is, we added the 
four covariates 

{fji-ki)+, {fji-k2) + , 

(1) 

(57—^3)+, (57—^4) + 

to the y model, where (z)+ = max(0, z) and fei, A:2, fcs, 
/c4 are the knots. Over 1000 samples, we found that 
this new version of floLS (which, in our article, we 
would have called /In-cov) performed as well as any 
of Tan's estimators in the scenario where both mod- 
els were incorrect. With n = 200, we obtained bias = 
0.16, % bias = 5.70, RMSE = 2.78 and MAE = 1.78. 
With n = 1000, we obtained bias = 0.30, % bias = 
24.6, RMSE = 1.27 and MAE = 0.88. The perfor- 
mance of Tan's regression estimators in these sim- 
ulations is impressive. The performance of fioLS is 
equally impressive if we allow the analyst to make a 
simple correction to adjust for the y-model's obvious 
lack of fit. 

4. RESPONSE TO RIDGEWAY AND 
MCCAFFREY 

Ridgeway and McCaffrey correctly observe that, 
for estimating propensity scores, there are many good 
alternatives to logistic regression. In addition to their 
work on the generalized boosted model (GBM), some 
have been estimating propensities using classifica- 
tion trees (Luellen, Shadish and Clark, 2005) and 
neural networks (King and Zeng, 2002). 







no ^ 








/ 


















\ 















' T 1 1 1 i r 



-4-3-2-10 1 2 3 

linear predictor 

Fig. 1 . Scatterplot of raw residuals from linear y -model fit to 
respondents in Tan's alternative model, versus the linear pre- 
dictors from a logistic tt -model, with local polynomial (loess) 
fit. 
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A rich propensity model should improve the per- 
formance of the weighted estimator. The advantage 
of procedures like classification trees and GBM is 
that they allow us to search through a large space of 
vr-models, accounting for the effects of many covari- 
ates and their interactions, thereby reducing bias 
in the resulting estimator regardless of how yi is 
distributed. These procedures may also reduce vari- 
ance, because, as explained by Tan, in a sequence 
of increasingly rich propensity models, the asymp- 
totic variance of an augmented IPW estimator de- 
creases to the semiparametric bound. In principle, 
one could apply similar procedures like regression 
trees to create a rich y-model. But, as Ridgeway 
and McCaffrey point out, this raises the possibility 
of data snooping. As we search through larger and 
more complicated spaces to find the best y-model, 
it becomes increasingly difficult to compute honest 
standard errors. 

Ridgeway and McCaffrey's simulations with the 
extra interaction term again reveal the dangers of 
uncritically relying on fioLS- This interaction in- 
creases the degree to which the additive and linear 
y-model is misspecified, so in this scenario we would 
expect the performance of fioLS to worsen. The final 
columns of their Tables 1 and 2 show that, when this 
interaction is present, propensity-based and DR es- 
timators strongly outperform jloLS- Using the wrong 
covariates in the propensity model does little harm 
to the fiexible GBM procedure. But one could ar- 
gue that these comparisons between GBM and jloLS 
are unfair in the following sense: They resemble a 
situation where the analyst is allowed to fit a rich 
and flexible vr-model but is given no leeway to im- 
prove the y-model. We examined many samples of 
n = 200 from this new population and found X1X2 
to be a strong and highly signiflcant predictor of y 
in every sample. If we add this one interaction to 
the y-model, the bias in fioLS nearly vanishes, and 
its RMSE becomes comparable to that of the best 
DR estimators that Ridgeway and McCaffrey tried. 
Other interactions are often significant as well. We 
have not examined the performance of jloLS when 
these other interactions are included; doing so would 
be an interesting exercise. 

Our point here is not to argue for the superiority 
of p'OLS over the DR procedures. Either can work 
well if applied carefully with appropriate safeguards. 
And either can be made to fail if we, through the 
design of a simulation, impose artificial restrictions 
that force the analyst to ignore clear evidence in the 
observed data that the procedure is flawed. 



5. RESPONSE TO ROBINS, SUED, 
LEI-GOMEZ AND ROTNITZKY 

The comments by Robins et al. contain many use- 
ful observations and helpful references. Their simu- 
lations that reverse the roles of ti and 1 — ti are 
instructive. However, in the process of arguing that 
we misunderstood the message of Bang and Robins 
(2005), they have apparently misunderstood ours. 
Their insinuations of cherry-picking might be under- 
standable if we had been arguing for the superiority 
of fj'OLSi but that is not what we have done. Quite 
honestly, we began this investigation fully expecting 
to demonstrate the benefits of dual modeling when 
neither model is exactly true. 

When Bang and Robins (2005) recommended cer- 
tain DR procedures for routine use, they did so with- 
out qualiflcations or cautionary statements. Now they 
quote a passage from another article published five 
years earlier, which Bang and Robins (2005) did not 
cite, to demonstrate that this was not what they had 
in mind. Readers cannot react to what they have 
in mind, but only to what they write. Dr. Robins 
and his colleagues are eminent researchers, and their 
statements carry considerable weight. The fact that 
they knew that these estimators sometimes misbe- 
have but failed to acknowledge it makes their blan- 
ket recommendations in 2005 even more troubling. 

For the record, we will clarify how we came up 
with our simulated example. As mentioned in our 
Section 4, we were trying to loosely mimic a quasi- 
experiment to assess the average causal effect of di- 
eting on body-mass index among adolescent girls. 
We decided beforehand that yi should be predicted 
from the observed Xi with B? ~ 0.80, as in the ac- 
tual data. We decided that the distributions of the 
estimated propensity scores should resemble those in 
our Figure 3(e), as in the actual data. We decided 
that the linear predictors from the y-model and vr- 
model should have a correlation of at least 0.5, as in 
the actual data, so that yi = Uyt / J2i U would be 
a strongly biased as an estimator of fi. We decided 
that the covariates in Xj should not be normally dis- 
tributed, but they should not be so heavily skewed 
that a data analyst would need to transform them to 
reduce the leverage of a few large values. We decided 
that Xi must be a one-to-one transformation of the 
unseen true covariates Zi over the effective support 
of the Zi (without this condition, nonresponse would 
not be ignorable given Xi). Finally, we decided that 
the linear regression of yi and the logistic regres- 
sion of ti on Xi would be misspecified to about the 
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same extent, in the sense that the correlations be- 
tween the linear predictors from each model and the 
corresponding true linear predictors would be about 
0.9. 

After considerable trial and error, we came up 
with one example that met all of these criteria. As 
we ran our simulations, we were truly surprised to 
see fioLS perform as well as it did, consistently beat- 
ing all competitors. We expected that at least some 
of the DR estimators would improve upon fioLS, but 
none did. In fact, we were tempted to look for a dif- 
ferent example that would demonstrate some of the 
benefits of DR, but we decided against it precisely 
because we wanted to avoid cherry-picking. 

As Robins et al. deconstruct our simulated exam- 
ple, they suggest that our misspecified linear model 
E{yi) = xj(3 is so close to being true that fioLS is 
virtually guaranteed to outperform all competitors. 
If that were so, then why did the DR estimators 
PwLS and jj-BC-OLS not perform as well, as those es- 
timators were given the same opportunity to take 
advantage of this nearly correct y-model? And, if 
that were so, why would fioLS perform so poorly in 
their simulations when the roles of ti and 1 — ti were 
reversed? 

The first plot in Figure 1 by Robins et al. reveals 
that (a) the model for yi given the vector of true 
covariates Zi is a linear with very high and (b) 
the nonresponse is ignorable, so that P{yi \ Zi,ti = 1) 
and P{yi \ Zi,ti = 0) are the same. This plot im- 
plies that conditions where the analyst is allowed 
to see the Zj's are unrealistic, because knowing Zi is 
essentially equivalent to knowing yi. But this plot 
says nothing about the performance of fioLS or any 
other estimator when Zi is hidden and the analyst 
sees only Xj, which is the only scenario that we have 
claimed is realistic. [In fact, the first simulated ex- 
ample published by Bang and Robins (2005) yields 
a similar picture, because their true data-generating 
mechanism is also linear and their is 0.94.] The 
conditional variance V{yi | Zi) was one of many pa- 
rameters that we had to adjust to create an example 
that satisfied all of the criteria that we have men- 
tioned. We tried to set V{yi \ Zi) to larger values, but 
doing so decreased the signal-to-noise ratio in the 
observed data to the point where we no longer saw 
meaningful biases in any estimators when n = 200. 

With their Figure 2, Robins et al. purport to show 
that our misspecified linear regression model fits so 
well that the predicted values xj (3 are essentially 



unbiased predictions of the missing yj's, which guar- 
antees excellent performance for fioLS- They state, 
"We can see that the predicted values of the nonre- 
spondents are reasonably centered around the straight 
line even for those points with predicted values far 
from the predicted values of the respondents." On 
the contrary, our linear model E{yi) = xj f5 does not 
give unbiased predictions for nonrespondents or re- 
spondents, especially not in the region of extrapola- 
tion. To illustrate, we took one simulated sample of 
n = 1000 observations, regressed on Xi among the 
respondents, and computed the regression predic- 
tions xJ [5 and residuals yi — xJ f5 for both groups. 
A plot of the residuals versus the regression pre- 
dictions is displayed in Figure 2, along with local 
polynomial (loess) trends. Respondents are shown 
in black, and nonrespondents are shown in gray. 
(For visual clarity, only 20% of the points are dis- 
played, but the loess trends are estimated from the 
full sample.) For each group, the least-squares re- 
gression model strongly underpredicts near the cen- 
ter and overpredicts at the extremes. The reason 
why jloLS performs well in this example is not that 
the linear model is approximately true, but that the 
positive and negative residuals in the nonrespondent 
group approximately cancel out. The average value 
of yi — xJ (5 for respondents is exactly zero (a conse- 
quence of OLS), and the average value of yi — xJ (3 
for nonrespondents is close to zero. Over 1000 sim- 
ulated samples, the average of yi — xJ (3 among non- 
respondents was 1.68. Multiplying this by —0.5 (be- 
cause the average nonresponse rate is 50%) gives 
—0.84, the estimated bias for jloLS reported in our 
Table 3. 

Figure 2 also reveals why (xoLS was not beaten 
in this example by any of the dual-modeling meth- 
ods. The differences between the two loess curves in 
Figure 2 are not large, showing that the OLS predic- 
tions have similar patterns of bias for respondents 
and nonrespondents. When the predictions from a 
y-model are biased, and the biases are similar when 
ti = l and ti = 0, they are not easily corrected by an 
estimated propensity model. 

If we reverse the roles of ti and 1 — ti , as Robins 
et al. have done, the situation dramatically changes. 
Taking the same sample of n = 1000, we regressed 
yi on Xi when tj = and predicted the responses for 
both groups. Residuals versus predicted values from 
this reverse fit are shown in Figure 3. (Once again, 
for visual clarity, only 20% of the sampled points 
are shown, but the loess trends are estimated from 
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Fig. 2. Residuals versus predicted values for respondents 
(ti = l) (black dots) and nonrespondents (ti—Q) (gray dots) 
from one sample of n — 1000 from our original simulation, 
with local polynomial (loess) trends for each group. For visual 
clarity, only 20% of the sampled points are shown. 
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Fig. 3. Plot analogous to Figure 2, with the roles of ti and 
1 — ti reversed. Cases with ti = and ti = 1 are denoted by 
black and gray dots, respectively, with local polynomial (loess) 
trends shown for each group. For visual clarity, only 20% of 
the sampled points are shown. 

the full sample.) For the = group, the linear 
model underpredicts at the center and overpredicts 
at the extremes, and the average value = xj 13 is 
zero. But for the t « = 1 group, the linear model con- 
sistently overpredicts across the entire range, intro- 
ducing a strong upward bias into jloLS- 

This alternative simulation by Robins et al. is a 
classic example where patterns of bias in a linear 
y-model cause fioLS to perform poorly. But because 
the patterns are dramatically different when i j = 
and = 1, it is also a classic example where the fail- 
ure can be readily diagnosed and corrected by fitting 
a vr-model. A plot of the residuals yi — xf/3 for the 
ti = group versus the linear predictors from a lo- 
gistic propensity model is shown in Figure 4. The 



plot, which is based only on {xi,ti,{l — ti)yi), shows 
a strong tendency for the linear y-model to overpre- 
dict when P{ti = 1) is low or high. To correct this 
bias, we created a spline basis as in expression (1), 
with knots at the sample quintiles, and included the 
four extra terms as predictors in the linear ?/-model. 
The performance of fioLS (which we would now call 
fi"K-cov) improved dramatically, and the new estima- 
tor worked better than any of the dual-modeling 
methods reported by Robins et al. The performance 
statistics in the both-models-wrong scenario were 
Bias = 2.21, Var. = 12.61 and MSE = 17.46 when 
n = 200, and Bias = 2.40, Var. = 1.88, and MSE = 
7.66 when n = 1000, which compare favorably to the 
results shown by Robins et al. in their Table 2. 

6. CONCLUDING REMARKS 

As statisticians devise newer and fancier methods, 
we hope to find one that is foolproof, yielding good 
results no matter when and how it is applied. But 
the search for a foolproof method is quixotic and 
futile. Some procedures are, on balance, better than 
others, but each one requires many subjective in- 
puts, and none should be applied routinely or uncrit- 
ically. As we develop better estimators, we should 
also strive to give potential users a healthy dose of 
intuition about how the procedures work, their lim- 
itations, sound recommendations about their use, 
and diagnostics that can help users decide when a 
procedure is trustworthy and when it is not. 

In conclusion, we believe that propensity modeling 
is prudent and even necessary when rates of missing 
information are high. But we are still not convinced 
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Fig. 4. Scatterplot of residuals from a linear y-model fit to 
ti — cases, versus linear predictors from a logistic iv-model, 
with local polynomial (loess) fit in one sample of n — 1000 
from the alternative simulation study by Robins et al. 
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that estimated inverse propensities must always be 
used as weights. 
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